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Abstract 

We present the discriminative recurrent sparse auto-encoder model, comprising a 
recurrent encoder of rectified linear units, unrolled for a fixed number of itera- 
tions, and connected to two linear decoders that reconstruct the input and predict 
its supervised classification. Training via backpropagation-through-time initially 
minimizes an unsupervised sparse reconstruction error; the loss function is then 
augmented with a discriminative term on the supervised classification. The depth 
implicit in the temporally-unrolled form allows the system to exhibit far more 
representational power, while keeping the number of trainable parameters fixed. 
From an initially unstructured network the hidden units differentiate into 
categorical-units, each of which represents an input prototype with a well-defined 
class; and part-units representing deformations of these prototypes. The learned 
organization of the recurrent encoder is hierarchical: part-units are driven di- 
rectly by the input, whereas the activity of categorical-units builds up over time 
through interactions with the part-units. Even using a small number of hidden 
units per layer, discriminative recurrent sparse auto-encoders achieve excellent 
performance on MNIST. 



1 Introduction 



Deep networks complement the hierarchical structure in natural data fBengio 2009i By breaking 



complex calculations into many steps, deep networks can gradually build up complicated decision 
boundaries or input transformations, facilitate the reuse of common substructure, and explicitly com- 



pare alternative interpretations of ambiguous input (Lee, Ekanadham, & Ng 2008 Zeiler, Taylor, 
& Fergus} |201 l| l. Leveraging these strengths, deep networks have facilitated significant advances 
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in solving sensory problems like visual classification and speech recognition (Dahl, et al. 
[Hinton, Osindero, & Teh, 2006; Hinton, et al.|[20T2] l. 

Although deep networks have traditionally used independent parameters for each layer, they are 
equivalent to recurrent networks in which a disjoint set of units is active on each time step. The 
corresponding representations are sparse, and thus invite the incorporation of powerful techniques 
from sparse coding ([Glorot, Bordes, & Bengio[ |2011[ |Lee, Ekanadham, & Ng[ |2008t |01shausen &| 
Field 1996 1997 Ranzato, et al. [2006 ). Recurrence opens the possibility of sharing parameters 
between successive layers of a deep network. 

This paper introduces the Discriminative Recurrent Sparse Auto-Encoder model (DrSAE), compris- 
ing a recurrent encoder of rectified Unear units (ReLU; 



Coates & Ng, 2011 



Glorot, Bordes, & 



Bengio|[20TT]|Jarrett, et al.||2009||Nair & Hintonl|2010t[SaUnas & Abbott,, 1996| l, connected to two 
linear decoders that reconstruct the input and predict its supervised classification. The recurrent en- 
coder is unrolled in time for a fixed number of iterations, with the input projecting to each resulting 



layer, and trained using backpropagation-through-time (Rumelhart, et al. ,1986). Training initially 



minimizes an unsupervised sparse reconstruction error; the loss function is then augmented with a 
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discriminative term on the supervised classification. In its temporally-unrolled form, the network 
can be seen as a deep network, with parameters shared between the hidden layers. The temporal 
depth allows the system to exhibit far more representational power, while keeping the number of 
trainable parameters fixed. 

Interestingly, experiments show that DrSAE does not just discover more discriminative "parts" of 
the form conventionally produced by sparse coding. Rather, the hidden units spontaneously dif- 
ferentiate into two types: a small number of categorical-units and a larger number of part-units. 
The categorical-units have decoder bases that look like prototypes of the input classes. They are 
weakly influenced by the input and activate late in the dynamics as the result of interaction with the 
part-units. In contrast, the part-units are strongly influenced by the input, and encode small trans- 
formations through which the prototypes of categorical-units can be reshaped into the current input. 
Categorical-units compete with each other through mutual inhibition and cooperate with relevant 
part-units. This can be interpreted as a representation of the data manifold in which the categorical- 
units are points on the manifold, and the part-units are akin to tangent vectors along the manifold. 



1.1 Prior work 



The encoder architecture of DrSAE is modeled after the Iterative Shrinkage and Threshold Algo- 
rithm (ISTA), a proximal method for sparse coding ( [Chambolle, et al.[|1998[ Daubechies, Defrise, &| 
De Mol| 2004[ l. |Gregor & LeCun (2010 1 showed that the sparse representations computed by ISTA 
can be efficiently approximated by a structurally similar encoder with a less restrictive, learned pa- 
rameterization. Rather than learn to approximate a precomputed optimal sparse code, the LISTA au- 
toencoders of |Sprechmann, Bronstein, & Sapiro| ( 2012a|b i are trained to directly minimize the sparse 
reconstruction loss function. DrSAE extends LISTA autoencoders with a non-negativity constraint, 
which converts the shrink nonlinearity of LISTA into a rectified linear operator; and introduces a uni- 



fied classification loss, as previously used in conjunction with traditional sparse coders (Bradley & 
Bagnell , 2008 ; Mairal, et al. 2009 , Mairal, Bach, & Ponce 2012| l and other autoencoders ( Boureau, 



et al., ,2010,,Ranzato & Szummer,,2008| ). 



DrSAEs resemble the structure of deep sparse rectifier neural networks (Glorot, Bordes, & Ben- 
2011 1, but differ in that the parameter matrices at each layer are tied ( |Bengio, Boulanger- 
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Lewandowski, & Pascanu 2012| l, the input projects to all layers, and the outputs are normalized. 



DrSAEs are also reminiscent of the recurrent neural networks investigated by Bengio & Gingras 



([T996 ), but use a different nonlinearity and a heavily regularized loss function. Finally, they are simi- 
lar to the recurrent networks described by Seung ( 1998 1, but have recurrent connections amongst the 
hidden units, rather than between the hidden units and the input units, and introduce classification 
and sparsification losses. 



2 Network architecture 

In the following, we use lower-case bold letters to denote vectors, upper-case bold letters to denote 
matrices, superscripts to indicate iterative copies of a vector, and subscripts to index the columns 
(or rows, if explicitly specified by the context) of a matrix or (without boldface) the elements of a 
vector We consider discriminative recurrent sparse auto-encoders (DrSAEs) of rectified linear units 
with the architecture shown in figure [T] 

(0,E-x + S-z'-b) (1) 



z*^"'^ = max 



for t — 1, . . . , T, where n-dimensional vector z* is the activity of the hidden units at iterati on t, m- 
dimensional vector x is the input, and z*^° = 0. Unlike traditional recurrent autoencoders (Bengio, 
|Boulanger-Lewandowski, & Pascanu||2012| l, the input projects to every iteration. We call the nxm 
parameter matrix E the encoding matrix, and the n x n parameter matrix S the explaining-away 
matrix. The n-element parameter vector b contains a bias term. The parameters also include the 
m X n decoding matrix D and the I x n classification matrix C. 

We pretrain DrSAEs using stochastic gradient descent on the unsupervised loss function 

i'' = ^l|x-D.z^||; + A.||z^||^, (2) 
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Figure 1: The discriminative recurrent sparse auto-encoder (DrSAE) architecture, z* is the hidden 
representation after iteration t of T, and is initiahzed to z*^ = 0; x is the input; and y is the supervised 
classification. Overbars denote approximations produced by the network, rather than the true input. 
E, S, D, and b are learned pai^ameters. 

with the magnitude of the columns of D bounded by ij^and the magnitude of the rows of E bounded 
by M5|2|We then add in the supervised classification loss function 



L-^ = logistic C 



(3) 



where the multinomial logistic loss function is defined by 



logistic (z) 



log 



and y is the index of the desired classj^ Starting with the parameters learned by the unsupervised 
pretraining, we perform discriminative fine-tune by stochastic gradient descent on + L^, with 
the magnitude of the rows of C bounded by sj^ The learning rate of each matrix is scaled down by 
the number of times it is repeated in the network, and the learning rate of the classification matrix 
is scaled down by a factor of 5, to keep the effective learning rate consistent amongst the parameter 
matrices. 

We train DrSAEs with T — 11 recurrent iterations (ten nontrivial passes through the explaining- 
away matrix Sj^Jand 400 hidden units on the MNIST dataset of 28 x 28 grayscale handwritten digits 
(LeCun, et al. |1998[ ), with each input normalized to have £2 magnitude equal to 1. We use a training 
set of 50,000 elements, and a validation set of 10,000 elements to perform early-stopping. Encoding, 
decoding, and classification matrices learned via this procedure are depicted in figure |2] 

The dynamics of equation fT] are inspired by the Learned Iterative Shrinkage and Thresholding Al- 
gorithm (LISTA) (Gregor & LeCun, 2010|, an efficient approximation to the sparse coding Iterative 
Shrinkage and Threshold Algorithm (ISTA) (|Chambolle,"eraL| [T9981 [Daubechies, Defrise, & De 



'This sets the scale of z; otherwise, the magnitude of z will shrink to zero and the magnitude of the columns 
of D wil l explode. This and all other such constraints are enforced by a projection after each SGD step. 



The size of each ISTA step must be sufficiently small to guarantee convergence. As the step size grows 



large, the input will be over-explained by multiple aligned hidden units, leading to extreme oscillations. This 
botmd serves the same function as £2 weight regularization ( Hinton 2 010|>. The particular value of the bound 



is heuris tic, and was determined by an informal search of parameter space. 



^Con sistent with standard autoencoders but unlike traditional applications of backpropagat ipn-through-time. 

only depend directly on the final iteration of the hidden units z' 



the loss functions L^' and L , 

"*As in the case of the encoder, this serves the same function as £2 weight regularization (Hinton| 2010 1. The 
particular value of the bound is heuristic, and was determined by an informal search of parameter space. 



^The chosen number of recurrent iterations achieves a heuristic balance between representational power and 



computational expense. Experiments were conducted with T G {2, 6, 11, 21}. 
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Figure 2: The hidden units differentiate into spatially localized part-units, which have well-aligned 
encoders and decoders; and global prototype categorical-units, which have poorly aligned encoders 
and decoders. A subset of the rows of encoding matrix E (a) and the columns of decoding ma- 
trix D (b), and all rows of the classification matrix C (c) after training. The first row of (a,b) shows 
the most categorical units; the last row contains the least categorical units; and the middle row evenly 
steps through the remaining units in order of categoricalness. Gray pixels denote connections with 
weight 0; darker pixels indicate more positive connections. 



Moil|2004 . [STA is an algorithm for minimizing the i?i -regularized reconstruction loss function L" 
of equation [2] with respect to z^. It is defined by the iterative step 

z*+^ K.\ (a • • X + (I - a • • D) • z*) , 

where [hg{yL)\^ — sign(a:i) • max (0, \xi\ — 6) and a is a small step-size parameter With non- 
negative units, ISTA is equivalent to projected gradient descent of of equation|2] As the number 
of iterations T — > 00, a DrSAE defined by equation [T]becomes a non-negative version of ISTA if it 
satisfies the restrictions: 

E = a-D^, S==I-a-D^-D, 6, = a • A , and z ■ > , (4) 

where the positive scale factor a is less than the maximal eigenvalue of • D, and I is the n x n 
identity matrix. 

As in LISTA, but unlike ISTA, the encoding matrix E and explaining-away matrix S in a DrSAE 
are independent of the decoding matrix D. Connections from the input to the hidden units, and 
recurrent connections between the hidden units, are all-to-all, so the network structure is agnostic to 
permutations of the input. DrSAEs can also be understood as deep, feedforward networks with the 
parameter matiices tied between the layers. 



3 Analysis of the hidden unit representation 



Discriminative fine-tuning naturally induces the hidden units of a DrSAE to differentiate into a 
hierarchy-like continuum. On one extreme are part-units, which perform an ISTA-like sparse cod- 
ing computation; on the other are categorical-units, which use a sophisticated form of pooling to 
integrate over matching part-units, and implement winner-take-all dynamics amongst themselves. 
Converging lines of evidence indicate that these two groups use distinct computational mechanisms 
and serve different representational roles. 

In the ISTA algorithm, each row of the encoding matrix E^ (which we sometimes call the encoder 
of unit i) is proportional to the corresponding column of the decoding matrix (which we call the 
decoder of unit i), and each row (S — I)^ is proportional to (D^)^ • D, as in equation|4] As a result, 
the angle between E^ and D^, and the angle between the rows of S — I and • D, are both simple 
measures of the degree to which a unit's dynamics follow the ISTA algorithm, and thus perform 
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Figure 3: The hidden units differentiate into two populations after discriminative fine-tuning. The 
magnitude of row (S — I)^ (a,b,e) and Ci (c,d), versus the angle between encoder row and decoder 
column, for each unit from networks using 1 1 (a,c,e) and 2 (b,d,f) iterations. All plots are from 
discriminatively fine-tuned networks except (a,b), which are only subject to unsupervised pretrain- 
ing. We call the dense cloud in the bottom-left part-units, and the tail extending to the top-right 
categorical-units. 



sparse codingjj These quantities are equal to in the case of perfect ISTA, and grow larger as the 
network diverges from ISTA. Of these two angles, the explaining-away matrix comparison is more 
difficult to interpret, since a distortion of any one unit's decoding column will affect all rows of 
• D, whereas the angle between the encoder row and decoder column only depends upon 
a single unit. For this reason, we use the angle between the encoder row and decoder column as a 
measure of the position of each unit on the part/categorical continuum. 



^We always use S — I when plotting recurrent connection strength, since it governs the perturbation of the 
otherwise stable hidden unit activations, as in projected gradient descent of L^; i.e., ISTA. 
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Figure [3] plots, for each unit i, the magnitude of row (S — I)^ and column C^, versus the angle 
between row and column D^. Before discriminative fine-tuning, there are no categorical-units; 
the angle between the encoder row and decoder column is small and the incoming recurrent con- 
nections are weak for all units, as in figure [3];a,b). After discriminative fine-tuning, there remains 
a dense cloud of points for which the angle between the encoder row and decoder column is very 
small, and the incoming recurrent and outgoing classification connections are weak. Abutting this 
is an extended tail of points that have a larger angle between the encoder row and decoder column, 
and stronger incoming recurrent and outgoing classification connections. We call units composing 
the dense cloud part-units, since they have ISTA-compatible connections, while we refer to those 
making up the extended tail as categorical-units, since they have strong connections to the classifi- 
cation output]^ When trained on MNIST, part-units have localized, pen stroke-like decoders, as can 
be seen in the bottom rows of figure [2]; a,b). Categorical-units, in contrast, tend to have whole-digit 
prototype-like decoders, as in the top rows of figure |2|a,b). Discriminative fine-tuning induces the 
differentiation of categorical-units regardless of the depth of the encoder. 



3.1 Part-units 



Examination of the relationship between the elements of S — I and D • D confirms that part- 
units with an encoder-decoder angle less than 0.5 radians abide by ISTA, and so perform sparse 
coding on the residual input after the categorical-unit prototypes are subtracted out. The prominent 
diagonals with matching slopes in figure|4];a,b), which plot the value of Sij — Si,j versus • Dj for 
connections between part-units, and from categorical-units to part-units, respectively, demonstrate 
that part-units receive ISTA-consistent connections from all units. The fidelity of these connections 
to the ISTA ideal is not strongly dependent upon whether the afferent units are ISTA-compliant part- 
units, or ISTA-ignoring categorical-units. As a result, the part-units treat the categorical-units as if 
they were also participating in the reconstruction of the input, and only attempt to reconstruct the 
residual input not explained by the categorical-unit prototypes. 

As can be seen in figure |4|c), the degree to which the encoder conforms to the ISTA algorithm 
is strongly correlated with the degree to which the explaining-away matrix matches the ISTA al- 
gorithm. Figure |5] shows the decoders associated with the strongest recurrent connections to three 
representative part-units. As expected, the decoders of these afferent units tend to be strongly aligned 
or anti-aligned with their target's decoder, and include both part-units and categorical-units. 



3.2 Categorical-units 



In contrast, the recurrent connections to categorical-units with an encoder-decoder angle greater 
than 0.7 radians are not strongly correlated with the values predicted by ISTA. Rather than analyz- 
ing connections to the categorical-units only based upon their destination, it is more informative to 
consider them organized by their source. Part-units are compatible with categorical-units of certain 
classes]^ and not with others, as shown by figure |6[ a). Part-units generally have positive connec- 
tions to categorical-units with parallel prototypes, independent of offset, and negative connections 
to categorical-units with orthogonal prototypes, as shown in figure [7| a). This corresponds to a so- 
phisticated form of pooling (Jarrett , et al.| |2009), with a single categorical-unit drawing excitation 
from a large collection of parallel but not necessarily perfectly aligned part-units, as in figure 6|c). It 
is also suggestive of the standard Hubel and Wiesel model of complex cells in primary visual 



cortex 



( Hubel & Wiesel I962| l. ISTA would instead predict a connection proportional to the inner product. 



which is zero for orthogonal prototypes and negative for anti-aligned prototypes. 

Part-units use sparse coding dynamics, and so are not disproportionately suppressed by categorical- 
units that represent any particular class. However, each part-unit is itself compatible with (i.e., has 
positive connections to) categorical-units of only a subset of the classes. As a result, the categorical- 



'For the purpose of constructing figures characterizing the difference between part-units and categorical- 
units, we consider units with encoder-decoder angle less than 0.5 radians to be part-units, and units with 
encoder-decoder angle greater than 0.7 radians to be categorical-units. These thresholds are heuristic, and 
fail to reflect the continuum that exists between part- and categorical-units, but they facilitate analysis of the 



extremes, 

8 



Categorical-units have strong, sparse classification matrix projections, as shown in figures |2|c) and|3je,f), 
and can be identified with the output class to which they have the strongest projection. 
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Figure 4: Part-units have connections consistent with ISTA. The actual connection weights S — I 
versus the ISTA-predicted weights • D, for connections from part-units to part-units (a) and 
categorical-units to part-units (b); and the angle between the rows of S — I and the ISTA-ideal 
• D versus the angle between the encoder rows and decoder columns (c). Units are considered 
part-units if the angle between their encoder and decoder is less than 0.5 radians, and categorical- 
units if the angle between their encoder and decoder is greater than 0.7 radians. 



Dest Source units 




Figure 5: Part-units receive ISTA-compatible connections and thus perform sparse coding on the 
residual input after the contribution of the categorical-units is subtracted out. The decoders of the 
twenty units with the strongest explaining-away connections \Sij — 6ij \ to three typical part-units, 
sorted by connection magnitude. The left-most column depicts the decoder of the recipient part- 
unit. The bars above the decoders in the remaining columns indicate the strength of the connections. 
Black bars are used for positive connections, and white bars for negative connections. 



units and thus the class chosen are determined by the part-unit activations. In particular, only a 
subset of the possible deformations implemented by part-unit decoders are freely available for each 
prototype, since part-units with a strong negative connection to a categorical-unit will tend to silence 
it, and so cannot be used to transform the prototype of that categorical-unit. 

Categorical-units implement winner-take-all-like dynamics amongst themselves, as shown in fig- 
ure |6|b), with negative connections to most other categorical-units. Positive total self-connections 
Si^i facilitate the integration of inputs over time. 
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Figure 6: Categorical-units execute a sophisticated form of pooling over part-units, and have winner- 
take-all dynamics amongst themselves. The decoders of the categorical-units receiving the twenty 
strongest connections \Si,j — Si,j \ from representative part-units (a) and categorical-units (b), and 
the decoders of the part-units sending the twenty strongest projections to representative categorical- 
units (c). The connections are sorted first by the class of their destination, and then by the magnitude 
of the connection. The left-most column depicts the decoder of the source (a,b) or destination (c) 
unit. The bars above the decoders in the remaining columns indicate the strength of the connections. 
Black bars are used for positive connections, and white bars for negative connections. 



When activated, the categorical-units make a much larger contribution to the reconstruction than 
any single part-unit, as can be seen in figure |7]^b). Since, the projections from categorical-units 
to part-units are consistent with ISTA, the magnitude of the categorical-unit contribution to the 
reconstruction need not be tightly regulated. The part-units adjust accordingly to accommodate 
whatever residual is left by the categorical-units. 

The units form a rough hierarchy, with part-units on the bottom and categorical-units on the top. 
Categorical-units receive strong recurrent connections, as shown in figure [3|c,d) implying that their 
activity is more determined by other hidden units and less by the input (since the magnitude of the 
input connections is bounded), and thus they are higher in the hierarchy. As shown in figure |7jc), 
part-units receive most of their input from other part-units; categorical-units receive a larger fraction 
of their input from other categorical-units. Whereas part-units have well-structured encoders and are 
generally activated directly by the input on the first iteration, categorical-units are more likely to first 
achieve a non-zero activation on the second iteration, as shown in figure |7jd), suggesting that they 
require stimulation from part-units. The immediate response of part-units in contrast to the gradual 
refinement of categorical-units is apparent in figure |8] which shows the optimal decoding matrix for 
selected units, inferred from their observed activity at each iteration. 



4 Performance 

The comparison of MNIST classification performance in table [T] demonstrates the power of the hi- 
erarchical representation learned by DrSAEs. Rather than learn to minimize the sum of equations |2] 
and|3] |Gregor & LeCun| ^2010 ) train the LISTA encoder to approximate the code generated by a 
traditional sparse coder. WTiile they do not report classification performance using LISTA, Gregor 
and LeCun do evaluate MNIST classification error using the related learned coordinate descent al- 
gorithm. |Sprechmann, Bronstein, & Sapiro| ( |2012a|b | extend this approach by training a LISTA 
auto-encoder to reconstruct the input directly, using loss functions similar to equation |2] Although 
they identify the possibility of using regularization dependent upon supervised information, Sprech- 
mann and colleagues do not consider a parameterized classifier operating on a common hidden 
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Figure 7: Statistics of connections indicate the presence of a rough hierarchy, with categorical-units 
on the top integrating over part-units on the bottom. Average explaining-away connection weight 
Sij, binned by ahgnment between decoders, for connections from part-units to categorical-units (a). 
If no units fall in a given bin, the average is set to zero. Average final value of a unit zj^'^, given that 
> 0, versus the angle between the encoder row Ei and decoder column Di (b). Average angle 
between encoder row Ej and decoder column Dj of afferents to unit i, weighted by the strength 
of the connection to unit i, versus the angle between encoder row Ei and decoder column Di (c). 
Probability that zj ~ and zf > 0, versus the angle between the encoder row Ei and decoder 
column Di (d). Average value of the decoder column Di versus the angle between the encoder row 
Ei and the decoder column Di (e). 



representation. Instead, they train a separate encoder for each class, and classify each input based 
upon the encoder with the lowest sparse coding error DrSAEs significantly outperform these other 
techniques based upon a LISTA encoder. 

DrSAEs also perform well compared to other techniques using encoders related to LISTA. Deep 
sparse rectifier neural networks (Glorot, Bordes, & Bengio| 201 1\ combine discriminative training 



with an encoder similar to LISTA, but do not tie the parameters between the layers and only allow 
the input to project to the first layer. Differentiable sparse coding ( [Bradley & BagneTI] |2008[ ) and 
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Figure 8: Part-units (a) respond to the input quickly, while the activity of categorical-units (b) refines 
slowly. Columns of the optimal decoding matrices D* minimizing the input reconstruction error 
I |x — D* • z* 1 12 from the hidden representation z* for t = 1, . . . ,T. The first and last columns show 
the corresponding encoder and decoder for the chosen representative units. Intermediate columns 
represent successive iterations t. 



LISTA auto-encoder, 10 x (289-10 0^) 

( Sprechmann, Bronstein, & Sapiro 2012a[ l 

Learned coordinate desce nt, 784-784^°- 10 
( Gregor & LeCuii] [20T0| 

Differentiable sparse coding, 180-256* -10 
( Bradley & Bagnell 2008 ) 

Deep sparse rectifier neural network 

784-1000-1000-1000-10 

( Glorot, Bordes, & Bengio|[20n) 

Deep behef network 784-500-500-2000-10 
dHinton, etal.||20T2i 



Discriminative recurrent sparse auto-encoder 

784-400^1-10 

Supervised dictionary learning, 
45 X (784-24*) to 4 5 x (784-96*) 
(Mairal,etal.||2009i 



3.76 (5.98 with 289 hidden units) 

2.29 

1.30 

1.20 (1.16 with tanh nonlinearity) 

1.18 (0.92 with dropout) 

1.08 (1.21 with 200 hidden units) 

1.05 (3.56 without contrastive loss) 



Table 1: MNIST classification error rate (%) for pixel-permutation-agnostic encoders without 
boosting-like augmentations. The first column indicates the size of each layer in the specified en- 
coder, separated by hyphens. Exponents specify the number of recurrent iterations; asterisks denote 
repetition to convergence. 10 x (• • • ) indicates that a separate encoder is trained for each input class; 
45 X ( • • • ) indicates that a separate encoder is trained for each pairwise binary classification prob- 
lem. Further performance improvements have been reported with regularization techniques such as 
dropout, architectures that enforce translation-invariance, and datasets augmented by deformations, 
as discussed in the main text. 



supervised dictionary learning ( [Marral, et al. [2009 | l also train discriminatively, but effectively use 
an infinite-depth ISTA-like encoder, and are thus much less computationally efficient than DrSAEs. 
Supervised dictionary learning achieves performance statistically indistinguishable from DrSAEs 
using a contrastive loss function. A similar technique achieves MNIST classification error as low 
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as 0.54% when the dataset is augmented with shifted copies of the inputs (Mairal, Bach, & Ponce 
[20T2| . 

Additional regularizations and boosting-like techniques can further improve performance of net- 
works with LlSTA-like encoders. Recent examples include dropout, which trains and then averages 
over a large set of random subnetworks formed by removing a constant fraction of the hidden units 
from the original network ( Goodfellow, et aL| p013; Hint on, et al.[|2012) l. Deep belief networks 
and deep Boltzmann machines fine-tuned with dropout are the current state-of-the-art for pixel- 
permutation-agnostic handwritten digit recognition ( |Hinton, et al.| |2012| l, and can achieve MNIST 
classification error as low as 0.79% with a carefully tuned network structure and multi-step training 
procedure. Deep convex networks, which iteratively refine the classification by successively training 
a stack of classifiers, with the output of the i — 1st cla ssifier provided as input to the ith classifier. 



can achieve an MNIST error of 0.83% ( [Deng & Yu 201 1). Regularizing by explicit modeling of the 
data manifold, and then minimizing the square of the Jacobian of the out put along the tan gent bun- 
dle around the training datapoints, can reduce MNIST error to 0.81% ( |Rifai, et al. 2011 ). Further 
performance improvements are possible if translation invariance is built directly into the network via 
a convolutio nal architecture, and deformatio ns of the inputs are included in the train ing set ( LeCun, 
et al. 1998 1, yielding error as low as 0.23% ( jCiresan, Meier, & Schmidhuber |2012 1. These regular- 



izations and augmentations are potentially compatible with DrSAE, but we defer their exploration 
to future work. 

Recurrence is essential to the performance of DrSAEs. If the number of recurrent iterations is de- 
creased from eleven to two, MNIST classification error in a network with 400 hidden units increases 
from 1.08% to 1.32%. With only 200 hidden units, MNIST classification error increases from 1.21% 
to 1.49%, although the hidden units still differentiate into part-units and categorical-units, as shown 
in figure |3jd,f). 



5 Discussion 

It is widely believed that natural stimuli, such as images and sounds, fall near a low-dimensional 



manifold within a higher-dimensional space (the manifold hypothesis) ( Bengio, Courville, & Vin-| 
[centl|20T2irLee, Pedersen, & Mumfor"dl[2003j|Olshausen & Field] [2004) 1. The low-dimensional data 
manifold provides an intuitively compelling and empirically effective basis for classification (RifaQ 
|et al.| [2bll; Simard, LeCun, & Denker' 1993 ; Simard T^t al.||1998l l. The continuous deformations 
that define the data manifold usually preserve identity, whereas even relatively small invalid trans- 
formations may change the class of a stimulus. For instance, the various handwritten renditions of 
the digit 3 in in the last column of figure [9|c) barely overlap, and so the Euclidean distance between 
them in pixel space is greater than that to the nearest 8 formed by closing both loops. Neverthe- 
less, smooth deformations of one 3 into another correspond to relatively short trajectories along 
the data manifold]^ whereas the transformation of a 3 into an 8 requires a much longer path within 
the data manifold. A prohibitive amount of data is required to fully characterize the data manifold 
([Narayana n & Mitter||20lO ), so it is often approximated by the set of linear submanifolds tangent to 
the data manifold at the observed datapoints, known as the tangent spaces ( Ekanadham, Tranchinaj] 
& Simoncelii) |20TT] [Rifai, et al.[ [20TT] [Simard, et aT] [1998) . DrSAEs naturally and efficiently 



form a tangent space-like representation, consisting of a point on the data manifold indicated by the 
categorical-units, and a shift within the tangent space specified by the part-units. 

Before discriminative fine-tuning, DrSAEs perform a traditional part-based decomposition, familiar 
from sparse coding, as shown in figure [9| a). The decoding matrix columns are class-independent, 
local pen strokes, and many units make a comparable, small contribution to the reconstruction. Af- 
ter discriminative fine-tuning, the hidden units differentiate into sparse coding local part-units, and 
global prototype categorical-units that integrate over them. As shown in figure 9|b,c), the input is de- 
composed into a prototype, corresponding to a point on the data manifold; and a set of deformations 
from this prototype along the data manifold, corresponding to shifts within the tangent space. The 
same prototype can be used for very different inputs, as demonstrated in figure |9jc), since the space 
of deformations is rich enough to encompass diverse transformations without moving off the data 



'in particular, figure|9[c) shows how each input can be produced by identity-preserving deformations from 
a common prototype, using the tangent space decomposition produced by our network. 
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Figure 9: Discriminative recurrent sparse auto-encoders decompose the input into a prototype and 
deformations along the data manifold. The progressive reconstruction of selected inputs by the hid- 
den representation before (a) or after (b,c) discriminative fine-tuning. The columns from left to right 
depict either the components of the reconstruction (top row of each pair), or the partial reconstruc- 
tion induced by the first n parts (bottom row of each pair). Parts are added to the reconstruction 
in order of decreasing contribution magnitude; smoother transformations are possible with an op- 
timized sequence. The last two columns show the final reconstruction with all parts (Fin), and the 
original input (Inp). Bars above the decoding matrix columns indicate the scale factor/hidden unit 
activity associated with the column. 



manifold. Even when the prototype is very different from the input, all steps along the reconstruction 
trajectories in figure|9|b,c) are recognizable as members of the same class. 

The prototypes learned by the categorical-units for each class are not simply the average over the 
elements of the class, as depicted in figure [TO] Each class includes many possible input variations, 
so its average is blurry. The prototypes, in contrast, are sharp, and look like representative elements 
of the appropriate class. Many categorical-units are available for each class, as shown in figure |6] 
Not all categorical-units correspond to full prototypes; some capture global transformations of a 



prototype, such as rotations (Simard, et al. 1998 1. 



Consistent with prototypes for the non-negative MNIST inputs, the decoding matrix columns of 
the categorical-units are generally positive, as shown in figure [Tj^e). In contrast, the decoders of 
the part-units are approximately mean-zero and so cannot serve as prototypes themselves. Rather, 
they shift and transform prototypes, moving activation from one region in the image to another, as 
demonstrated in figure|9|b,c). 
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Figure 10: The prototypes learned by categorical-units resemble representative instances of the ap- 
propriate class, and are sharper than the average over all members of the class in the dataset. The 
left-most column in each group depicts the average over all elements of each of the ten MNIST digit 
classes. The other columns show the decoders of the associated units with the largest-magnitude 
columns in the classification matrix C. Bars above the decoders indicate the angle between the en- 
coder and the decoder for the displayed unit. The most prototypical unit always makes the strongest 
contribution to the classification, and has a large (but not necessarily the largest) angle between its 
encoder and decoder Some units that make large contributions to the classification represent global 



transformations, such as rotations, of a prototype (Simard, et al. 1998 1 



Discrepancies between the prototype and the input due to transformations along the data manifold 
are explained by class-consistent part-units, and only serve to further activate the categorical-units of 
that class, as in figure[6|a,c). Discrepancies between the prototype and the input due to deformations 
orthogonal to the data manifold are explained by class-incompatible part-units, and serve to suppress 
the categorical-units of that class, both directly and via activation of incompatible categorical-units. 

If the wrong prototype is turned on, the residual input will generally contain substantial unexplained 
components. Part-units obey ISTA-like dynamics and thus function as a sparse coder on the residual 
input, so part-units that match the unexplained components of the input will be activated. These part- 
units will have positive connections to categorical-units with compatible prototypes, and so will tend 
to activate categorical-units associated with the true class (so long as the unexplained components 
of the input are diagnostic). The spuriously activated categorical-unit will not be able to sustain its 
activity, since few compatible part-units will be required to capture the residual input. 

The classification approach used by DrSAEs is different from one based upon a traditional sparse 
coding decomposition: it projects into the space of deviations from a prototype, which is not the 
same as the space of prototype-free parts, as is clear from figure |9|a,b). For instance, a 5 can easily 
be constructed using the parts of a 6, making it difficult to distinguish the two. Indeed, the first seven 
progressive reconstruction steps of the 6 in figure |9| a) could just as easily be used to produce a 5. 
However, starting from a 6 prototype, the parts required to break the bottom loop are outside the 
data manifold of the 6 class, and so will tend to change the active prototype. 

DrSAEs naturally learn a hierarchical representation within a recurrent network, thereby implement- 
ing a deep network with parameter sharing between the layers. 
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